Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs
نویسنده
چکیده
Floating point sparse matrix vector multiplications (SM×V) are kernel operations for many scientific algorithms. In these algorithms, the SM×V is often responsible for the biggest part of the processing time. It is thus important to speed-up the processing of the SM×V. To use an FPGA to do this is a logical choice since FPGAs are inherently parallel. The core operation of the SM×V is to reduce arbitrarily many rows of values of arbitrary length to a single value for each row by summing all values within a row. This operation is called a reduction operation, the operator that implements this is called a reduction circuit. Reduction operations can use any binary operator that is commutative and associative. In the case of a SM×V this is a floating point adder. Because of pipelining of the floating point adder, extra complexity is introduced for reductions. Values need to be buffered and additional control logic is required. Furthermore, a proof is required to show that a certain buffer size is sufficient for every possible input. Important aspects of reduction circuits are thus buffer size, number of operators, latency, in-order output, area and clock speed. In literature, many reduction circuit algorithms are proposed. However, none of these algorithms have met the design criteria I use in this thesis. Most algorithms either require multiple operators or have buffer sizes that depend on the input. The algorithms that do not have these restrictions have large buffers and deliver output out-of-order. In this thesis an algorithm is introduced that uses 5 simple rules to check in which order values have to be reduced using a single associative and commutative binary operator. The latency of the reduction circuit is fixed and equals 2α + αdlog2 αe + 1 clock cycles, the buffer size is 2α + αdlog2 αe+ 1 for the output buffer and α + 1 for the input buffer. This is an improvement compared to designs described in literature. The buffer sizes and latency decrease if the minimal length of the input rows increases. The actual implementation is implemented on a Xilinx Virtex-4 4VLX160FF1513-10 FPGA (see appendix A). The total design runs at 200 MHz and consists of 3556 slices, 9 BlockRAMs and 3 DSP48 slices. Using this reduction circuit, the SM×V implementation is straightforward and requires a multiplier and a reduction circuit. Many of these combinations of a multiplier with a reduction circuit can be implemented in parallel. This results in a lot of processing power with the result that I/O will become the bottleneck.
منابع مشابه
Mapping Sparse Matrix-Vector Multiplication on FPGAs
Higher peak performance on Field Programmable Gate Arrays (FPGAs) than on microprocessors was shown for sparse matrix vector multiplication (SpMxV) accelerator designs. However due to the frequent memory movement in SpMxV, system performance is heavily affected by memory bandwidth and overheads in real applications. In this paper, we introduce an innovative SpMxV Solver, designed for FPGAs, SSF...
متن کاملSparse Matrix-Vector Multiplication on FPGAs
Floating-point Sparse Matrix-Vector Multiplication (SpMXV) is a key computational kernel in scientic and engineering applications. The poor data locality of sparse matrices signicantly reduces the performance of SpMXV on general-purpose processors, which rely heavily on the cache hierarchy to achieve high performance. The abundant hardware resources on current FPGAs provide new opportunities to...
متن کاملHigh performance sparse matrix-vector multiplication on FPGA
This paper presents the design and implementation of a high performance sparse matrix-vector multiplication (SpMV) on fieldprogrammable gate array (FPGA). By proposing a new storage format to compress the indexes of non-zero elements by exploiting the substructure of the sparse matrix, our SpMV implementation on a reconfigurable computing platform with a multi-channel memory subsystem is capabl...
متن کاملSparse Matrix-Vector Multiplication for Circuit Simulation
Sparse Matrix-Vector Multiplication (SpMV) plays an important role in numerical algorithm in circuit simulation. In this report, we utilize Message Passing Interface (MPI) to parallelize the SpMV. In addition, resulting from the circuit simulation matrix formulation, the circuit systems are often represented as unstructured, not evenly-distributed sparse matrices. Therefore, we automatically de...
متن کاملReconfigurable Sparse Matrix-Vector Multiplication on FPGAs
executing memory-intensive simulations, such as those required for sparse matrix-vector multiplication. This effect is due to the memory bottleneck that is encountered with large arrays that must be stored in dynamic RAM. An FPGA core designed for a target performance that does not unnecessarily exceed the memory imposed bottleneck can be distributed, along with multiple memory interfaces, into...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008